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Abstract 

Composite likelihood estimation has an important role in the analysis of multi¬ 
variate data for which the full likelihood function is intractable. An important issue 
in composite likelihood inference is the choice of the weights associated with lower¬ 
dimensional data sub-sets, since the presence of incompatible sub-models can deterio¬ 
rate the accuracy of the resulting estimator. In this paper, we introduce a new approach 
for simultaneous parameter estimation by tilting, or re-weighting, each sub-likelihood 
component called discriminative composite likelihood estimation (D-McLE). The data- 
adaptive weights maximize the composite likelihood function, subject to moving a given 
distance from uniform weights; then, the resulting weights can be used to rank lower¬ 
dimensional likelihoods in terms of their influence in the composite likelihood function. 

Our analytical hndings and numerical examples support the stability of the resulting 
estimator compared to estimators constructed using standard composition strategies 
based on uniform weights. The properties of the new method are illustrated through 
simulated data and real spatial data on multivariate precipitation extremes. 
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1 Introduction 


While likelihood-based inference is central to modern statistics, for many multivariate prob¬ 
lems the full likelihood function is impossible to specify or its evaluation involves a prohibitive 
computational cost. These limitations have motivated the development of composite likeli¬ 
hood approaches, which avoid the full likelihood by compounding a set of low-dimensional 
likelihoods into a surrogate criterion function. Composite likelihood inference have proved 
useful in a number of helds, including geo-stat istics, analy s is of s patial extremes, statistical 


genetics, and longitudinal data analysis. See 


Varin et al 


(120111) for a comp r ehens ive sur- 


Larribe and FearnheadI (120111 ) review 


vey of composite likelihood theory and applications, 
several applications in genetics. 

Let X be a d X 1 random vector and f{x\6) be the assumed density model for X, in¬ 
dexed by the parameter d G 0 C M^, p > 1. Suppose that the full likelihood function, 
L{9\x) oc f{x\9), is difficult to specify or compute, but we can specify low-dimensional dis¬ 
tributions with one, two, or more variables. Specihcally, let {Yj,j = 1,... ,m} be a set of 
marginal or conditional low-dimensional variables constructed from X with associated like¬ 
lihoods Lj{9\yj) oc fj{yj\9), where fj{-\9), 9 G 0 denotes the jth low-dimensional density 
model for Yj. The low-dimensional variables {Yj} are user-dehned and could be constructed 
by taking marginal models, like Xi,, X^, pairs like (Xi, X 2 ), or conditional variables like 
(Xi, X 2 )|X 2 . The overall struct ure of such lower-dim ensional models is sometimes referred as 


to composite likelihood design (iLindsav et al. 


20111) and its choice is often driven by compu¬ 


tational convenience. For example, if X follows a d-variate normal distribution Nd{0, S), the 
full likelihood is hard to compute when d is large due to inversion of S, which involves 0{d^) 
operations. In contrast, using sub-models for variable pairs (Xfc,Xfc/), 1 < k < k' < d, can 
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reduce the computational burden since it involves simply inverting 2x2 partial covariance 
matrices. 

Following iLindsayi (119881) . we define the composite likelihood function by 


CL(9Ka:) = n/jte|9r'. 

i=i 


( 1 ) 


where {wj^j = 1,... ,m} are non-negative weights, possibly depending on 9. A well-known 
issue in composite likelihood estimation is the selection of the weights, as their specification 
plays a crucial role in determi n ing both efficiency and re l iability of the r esulting composite 


likeli h ood estimato r (iLindsavl . 


2011 


Xu and Reidl. 


1988 


Joe and heel . 


2009 


Cox and Reidl . 


20041 : 


Varinetah 


201 ih . Despite the importanc e of the weigh t s, ma ny statistical and 


computational challenges still hinder their selection (iLindsay et al 


20n|) 


This paper is concerned with the aspect of stability of composite likelihood selection. 
Stability occurs when the maximizer of the overall composite likelihood function L{9\w) 
is not overly affected by the existence of locally optimal parameters that work only for a 
relatively small portion of such sub-sets, say Fi,..., fn* < n7./2. The presence of such 
local optima arises from the incompatibility between the assumed full-likelihood model and 
the m* lower dimensional models. For example, suppose that the true distribution of X is a 
d-variate normal distribution with zero mean vector, unit variance and correlations 2po for all 
variable pairs, while the true correlation is po for some small fraction of the d{d — l)/2 pairs. 
If one mistakenly assumes that all correlations are equal to po, both maximum likelihood 
and pair-wise likelihood estimators with uniform weight, Wj = 1/m, j = 1,... ,m, are not 
consistent for pn in this situation. Other examples of incompatible models are given in 


Xu and Reidl (1201 ll) . In applications, model compatibility is hard to detect, especially when 


m is large, so incompatible sub-models are often included in the composite likelihood function 
with detrimental effects on the accuracy of the global composite likelihood estimator. 

Motivated by the above issues, we introduce the discriminative maximum composite 
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likelihood estimator (D-McLE), a new methodology for reliable likelihood composition and 
simultaneous parameter estimation. The new approach computes smooth weights by max¬ 
imizing the composite likelihood function for a sample of observations subject to moving a 
given distance, say from uniform weights. The D-McLE is regarded as a generalization 
of the traditional McLE. If .^ = 0 the D-McLE is exactly the common composite likelihood 
estimator with uniform weights. When ^ > 0, incompatible sub-models are down-weighted, 
thus resulting in estimators for 6 with bounded worst-case bias. Our analytical Endings 
and simulations support the validity of the proposed method compared to classic composite 
likelihood estimators with uniform weights. The new framework is illustrated through esti¬ 
mation of max-stable models, which have prov ed useful for describin g extreme environmental 


occurrences as hurricanes, floods and storms (jPavison et al.l. 


2ni2h . 


The proposed procure would be useful in two respects. First, the resulting weights 
would be a valuable diagnostic tool for composite likelihood selection. Small weights would 
signal suspicious models, which could be further examined leading to improved assumptions. 
Conversely, the method can be employed to identify influential data sub-sets for many types 
of composite likelihood estimators. Second, the estimates obtained by such method would 
be trustworthy at least for the bulk of the data sub-sets models (which are compatible with 
model assumptions). Clearly, assigning the same weight to all the models including the ones 
in strong disagreement with the majority of data would lead to biased global estimates, 
which can be an untrustworthy representations of the entire data-set. 

The proposed method is a type of data tilting, a general technique which involves re¬ 
placing uniform weights with more general weights. To our knowledge, this is the first 
work that introduces tilting for lower-dimensional data sub-sets within the composite like¬ 
lihood framework. In robust statistics, tilting has been typically employed to robustify 
parametric estim ating equations, or to obtain natural data order in terms of their influence 


Choi et al 


(1200(1 . Tilting h as also been used 


ence of data-subsets; e.g., see 


o obta i n measures of outlyingness ari d influ- 


Hall and Presnell 

(19991 

Critchlev and Marriott 

to 

o 

o 

_ 

Lazar 
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feoosl); 


Camponovo and Otsii fl2012i) . 


Genton and Halil (120141) use a tilting approach in the 


context of multivariate functional data to ranking influence of data subsets. 

The rest the paper is organized as follows. In Section [2l we describe the new methodology 
for simultaneous likelihood selection/estimation; we give an efficient algorithm and introduce 
the compatibility plot, a new graphical tool to assess the adequacy of the sub-models. In 
Section El we study the properties of the new estimator and give its limit distribution. 
In Section 01 we provide simulated examples in hnite samples conhrming our theoretical 
hndings. In Section [51 we illustrate the new procedure to the Tasmanian rainfall spatial 
data on multivariate precipitation extremes. In Section |6l we conclude and discuss possible 
extensions for m —)■ oo. Proofs of technical results are deferred to a separate appendix. 


2 Methodology 

2.1 Composite likelihood selection 

Given independent observations ..., from the true distribution G{x), we construct 
the set of marginal or conditional low-dimensional observations Yj^\ ..., Yj^\ j = 1,... ,m, 
and dehne the weighted composite log-likelihood function 

m m n 

U0\w) = ^ V»,(9) = E ^ A(d'’|9). (2) 

j=l j=l i=l 

where w = (wi,... ,Wm)^ G [0,1]™' are constants playing the role of importance weights. 
The weight Wj characterizes the impact of the jth sub-likelihood, 

n 

inj{9) =n-^'^\ogfj{yj\9), 
i=l 

on the overall composite likelihood function in{9\w). We dehne incompatibility by assuming 
there is a global parameter, say 6*o G 0, which suits most sub-models. Specihcally, we assume 
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partial models Yj /?•(%■ I where 9j ^ Oq if j < m* < m/2 (incompatible models) and 
9j = 9q, a m* < j < m (compatible models). 

Next, we introduce the D-McLE procedure for simultaneous discrimination of discordant 
models and parameter estimation. We propose to select the weight Wj to be small when, 
for a value of 9 that is appropriate for the majority of the data sub-sets, the sub-likelihood 
function for the jth data sub-set, inj{9), is small. To this end, w is regarded as a discrete 
distribution on m points and the discrepancy between w and the uniform distribution Wunif = 
{1/m,... ,1/m) is measured by the Kullback-Leibler divergence 

m 

DKL{w,Wuni /) - Wj\og{mWj), (3) 

i=i 

where 0 < DKL{w,Wunif) < logm. For a given parameter 9, data-dependent weights Wn = 
Wn{9) are then chosen by solving the following program 

m 

max{4(6'|w)} , s.t.: T)xl(w, w«m/) = Wj = 1- (4) 

Finally, the D-McLE, denoted by 6* = 6*^, is then dehned as the maximizer of the composite 
log-likelihood function 

L{9) = in{9\Wn{9)) 

where Wn{9) = {wni{9),... ,Wnm{9))'^ is the vector of data-dependent weights. Equivalently, 
9^ can be obtained by computing the prohled estimator 9{w) by maximizing in{9\w) for a 
given weight and then solve (jl]) with 9 = 9{w). 

The composite likelihood estimator 9^ entails moving away from uniform weights in the 
direction that emphasizes the contribution of the most useful data sub-sets. If ^ >0, the 
relative importance of the sub-likelihoods that are incompatible with the data is diminished 
in the composite likelihood equation ([2]). The special case when = 0 corresponds to the 
composite likelihood estimator with uniform weights w = Wunif- Thus, all the data sub-sets 
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are regarded as equally compatible. Other divergence measures may be considered in place 
of the Kullback-Leibler divergence (jS]), which could be useful in particular estimation setups, 
although these are not pursued in this paper. The Kullback-Leibler divergence, however, has 
the advantage that allows one or more zero weights, and gives automatically nonnegative 
wights without imposing additional constraints by some algorithm to ensure this property. 
For example, when m is very large it could be useful to modify Dkl{w) to promote sparsity, 
i.e. select relatively a large number weights that are exactly zero. 


2.2 Data-adaptive weights and parameter estimation 

The program in (|1]) is solved by maximizing the Lagrangian function 

m / m \ 

h{w,Xi,X 2 \ 6 ) = '^Wj£nj{0) + Ai {DKL{w,Wunif) - O + -^2 I - 1 j , (5) 

j=l \j=l J 

where Ai and A 2 are Lagrange multipliers. It is easy to see that the solution to ([5]) has the 
form 

Wnj{9) = 02 exp{aiinj{9)}, j = ( 6 ) 


where ai and 02 depend on the Lagrange multipliers Ai and A 2 . From the two constraints 
in (jl]), «! = ai{9) and 02 = 02 ( 6 *) are obtained by solving 


^ E JLi exp{ai4j (9) }inj (9) 

Er.iexp{ai<„,(«)} 


log^exp{ai4j(6')} + logm, 
j=i 


(7) 


and 02 = 1/ J2JXiexp{ai£nj(9)}- The D-McLE 9^ is then computed by maximizing £n(9) = 
£n(9jvJn(9)). 

Lemma 1 in the appendix shows that computing the D-McLE, 9^, is equivalent to solving 
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the estimating equations 


UniO) = VeiniO) = '^WnjiO)UnjiO) = 0, 

i=i 


( 8 ) 


where UnjiO) = X]r=idenotes the partial score function corresponding to 
the jth data subset. Thus, UniO) is a weighted estimating equation involving the partial 
scores with weights depending on the data and 6 . A small weight Wnj implies a modest 
contribution of the jth score, Unj, to the overall composite likelihood equation. The constant 
^ is regarded as a stability parameter which can be used to control for the relative impact of 
the incompatible lower-dimensional likelihoods. Particularly, if ^ is large incompatible models 
will receive a low weight, with a relatively small effect on the hnal parameter estimates. If 
^ = 0, all the sub-models are treated equally in terms of the impact of corresponding sub- 
likelihoods in Un{ 0 ). 

Equation ([8]) highlights the resemblance to estimating functions of classic robust M- 
estimators, whose main aim is to reduce the influence of outliers in the full likelihood func- 
tio n. Indeed, the ap proach followed here coincides with the robust estimation approach 


by 


Choi et ah 


(1200011 in the particular case where: n = 1, Yi,... ,Ym are independent and 


all sub-models fj, j = l,...,m are all identica 
eral, however, the D-McLE is very different from 


to the fu ll likel ihood model, /. In gen- 


Choi et al 


(1200011 and other similar robust 


methods. The main difference is that the weights {wnj} in (|6]) refer to variables Yi,..., Y^, 
which are constructed by taking sub-sets of the original vector X and are possibly correlated; 
in robust M-estimation weights refer to independent observations on the original vector X. 
Thus, in our approach n observations corresponding to the jth data sub-set, namely Yj^\ 
i = 1,... ,n, receive the same weight, Wjn- This reflects our need to control for the incom¬ 
patibility of a portion of the sub-models, say fi,..., /m*, m* < m, rather than reducing the 
effect of outlying observations with respect to the full model /. 












2.3 Computing 


The form of equation ([8]) suggests a simple algorithm to simultaneously compute weights and 
parameter estimates. At each step of the algorithm, we update weights based on previous 
parameter estimates and then compute a fresh parameter estimate using the new weights. 
Starting from an initial estimate, we compute: 


§{t) 


9: 0 = ^ 




u. 


nj 


( 0 ) 


t > 1, 




(9) 


until convergence is reached. We consider a relative convergence criterion on the weights 


and stop iterating when — Wnf’ 


W| 

aj I 


Wn^'W < e, where e > 0 is some tolerance level. 


A practical advantageis that ([9]) is easy to implement when a basic composite likelihood 
estimator with fixed weights is already available. 

In our numerical studies, the algorithm gave satisfactory performances. In all our ex¬ 
amples convergence was reached in a few iterations and we noted that the computational 
cost does not increase much as m grows. This behavior makes the proposed algorithm well- 
suited to high-dimensional problems with a large number of sub-likelihoods and is shared by 


analogo us itera 


(e.g. see 


Arslan 


i vely 


re-weighted algorithms for M-estimation with well-established theory 


(120041) 1. Although we do not offer theoretical insight on the general theoret¬ 
ical behaviou r of our algorithm, conve rgence results may be derived following an argument 


analogous to 


Basu and Lindsavl (120041) in the context of iteratively reweighted procedures 


for minimum divergence estimators. 


2.4 Compatibility profile plots (CPPs) 

Let n(^) = {pi,... ,pm) be the arrangement of indices {!,..., m} implied by Wnp^{6^) < 
■ ■ ■ < Wnp^{d^), where Wnj{d^), j = 1,... ,n, are data-dependent weights computed by the 
algorithm in Section 12.31 The ordering 11 (,^) induces an importance ranking for the sub¬ 
models in terms of their compatibility with the true distribution generating the data. Based 
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on this ranking, a graphical tool is introduced, called a compatibility prohle plot (CPP). 
The CPP traces the htted weights, Wnj{0^), j = 1,... ,m, as ^ moves away from zero and 
can be used to inspect the compatibility of individual sub-likelihoods. For instance, a sharp 
decrease of the hrst m* weights from uniform weights Wunif = (1/m,..., 1/m), suggests that 
the hrst m* sub-likelihoods are likely to be misspecihed and a different model should be 
used for such components. The weights often exhibit diverging trajectories (see for example 
Figure [2]) which may be used to determine a suitable value for the parameter For example, 
the plots help us pick a value of ^ corresponding to a sufficient degree of separation between 
compatible and incompatible models. Eventually, ^ reaches an equilibrium point where the 
trajectories are maximally separated. After equilibrium, m —1 weights cluster together again 
as they tend to 0, where a single weight converges to 1. 

2.5 Selection of ^ 

The stability parameter ^ tunes the extent to which we down-weight incompatible models, 
which is important to discuss. One approach is to select the tuning constant ^ closest to 0 
(i.e., closest to uniform weights) such that the point estimates of the parameters of interest 
are sufficiently stable. If all the sub-likelihoods are compatible, = 0 already gives stable 
estimates and moving away ^ is expected to have little impact on the estimates. In the 
presence of incompatible sub-likelihoods, values of ^ close to 0 tend give unstable estimates 
in terms of bias and variance, so we move ^ away from 0 until stability is reached. For 
example in Figure [2] (right), the correlation estimator is far from the true correlation 
value of 0.5 when = 0. As moves away from zero, changes rapidly until stability is 
reached when ^ = 0.51. The above discussion suggests a simple data-driven procedure to 
select 

(1) Dehne an equally spaced grid 0 = .^o < ■Ci < 'C 2 < • • • < Cr < logm. 

(2) Starting from ^0 compute the correspondent point estimates, 9^., i = 0,... ,r. 
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(3) Select the optimal value using the stopping rule ^ = {min : ||6*g. — 9^._^ || < r}, where 
r > 0 is some threshold value. 

By dehnition, ^ is the value closest to 0 such that the variation of the point estimates is 
smaller than some acceptable threshold. Based on our simulations, a grid between = 0 
and = — log(l/2), with r = 5% x ||6'o|| typically works well and choices not too far from 0 
already give considerable stability. If a very small portion of data sub-sets are incompatible, 
it may be useful to consider rehnements of the grid near .^ = 0, such as, = {i/n) , 
i = 1, ...,r. 


3 Properties 


3.1 Large sample behavior of 6 ^ and standard errors 

To emphasize reliability aspects, it is helpful to distinguish between the true process gener¬ 
ating the data and the parametric model used for inference. Assume that X has distribution 
G{x), while the true distribution for the sub-vector Yj is denoted by Gj{yj). The den¬ 
sity function of Yj with respect to the dominating measure fi is denoted by gjiuj). Let 
{Fj{yj;9),9 G 0} be a parametric family of distributions for Yj and let fj{yj\9) denote the 
corresponding densities with respect to y. We assume that fj{yj\9) is identihable, i.e. for 
7 ^ 92, y[{Yj : fj(Yj\9i) 7 ^ fjiYj\92)}] > 0, for all j = !,■■■ ,m. 

The composite likelihood function ([2]) is correctly specihed if there is a parameter 9q E Q 
such that fj{yj\9o) = gj{yj) for all 1 < j < m; when no such 9q exists then ([2]) is misspecihed, 
meaning that it contains incompatible models. The optimal parameter, ^1, is dehned as the 
minimizer of the weighted composite Kullback-Leibler divergence 


9t = argmin Eg s log 


9iX) 


0G0 


UT=im^x\9y 


= argmin 
eee 


m 

i=i 


{ 0 ), 


( 10 ) 
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where 


e,{9) ^ -Eclnjie) = -Eg, {log f,{Y,\9)} 

is the cross-entropy between the true distribution Gj and the parametric sub-model fj{-\9) 
and Wj = Wj{9) = a2{9) exp{ai{9)£j{9)} (j = here denote asymptotic weights 

computed as in Section [2751 with inj{9) replaced by ij{9). In the remainder of the paper, we 
assume that is the unique maximizer of flTOl) . 

Next, consistency and asymptotic normality of 9^ are established. We note that stan¬ 
dard M-estimation theory cannot be applied directly to equation ([H]) because the weights 
{wnj{9),j = 1, • • ■ ,m} in (jlj) depend on random averages; thus some additional care is 
needed to characterize the asymptotic behavior of 9^. 


Proposition 3.1 Assume: (Cl) 9^ is an interior point in 0; (C2) supgg 0 \inj{9) —ij{9)\ 

0 as n —)■ oo (j = 1,... ,m); and ( C3) sup 0 g 0 £j{9) < oo (j = 1,... ,m). Then the maximum 
composite likelihood estimator 9^ converges in probability to defined in m- 

A direct consequence is Fisher-consistency of 9^, i.e. under correct composite likelihood 
specihcation the optimal target value is 9'( = 9q for all This can be seen by taking the 
expectation of equation ([ 8 ]) with 9 = 9q: 


Eg < y^^Wnj{9)unj{9) 




B=eo 



{9)EGmr,A9) 


Wn{9) 


= 0 , ( 11 ) 


0=00 


since EGjUnj{9o) = 0 if and only if Gj{-) = Ej{-\9o), for all 1 < j < m. This means that the 
estimating equation (IHl) is solved by 9o regardless of the choice of since changing the latter 
affects only the weights {wnj{9)}, but not the partial scores {unj{9)}. Section [3^ discusses 
bias in the presence of incompatible sub-likelihoods. 


Proposition 3.2 Under conditions (Cl) - (C3) in Pro'position \3. 1\ and additional regularity 
conditions given in the Appendix, y/n{9^ — converges in distribution to the p-variate 
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normal Np{0, ) as n —)■ oo, where and are the following p x p matrices 


[Hj{ei) + alE{ur.^{ei)}E{un,iei)f] , = Var I \ , (12) 

j=i [ j=i ) 

Hj{6) = E{VeUnjiO)}, w* = Wj{9’^) (j = l,...,m), al = ai(6*|) and expectations are with 
respect to G. 

The random weights, {wnj{d)}, play a crucial role in determining the asymptotic behavior 
of 6 ^. This feature is also found in model averaging, where parameter estimators obtained 
from different models, say fis £ S, are combined into a g lobal estimator ft, = 


through random weights Wns fjClaeskens and Hiort 


2008 


, Chapter 7). The connection with 


model averaging is further highlighted by the normal location example in Section 01 Here 
the random weights converge in probability to constants; thus, the asymptotic variance 
takes the u s ual sa ndwich form and H^, can be consistently estimated analogously to 


Varin et ah 


(120111) with weights Wnj{0^) (j = l,...,m), computed as in Section [2731 Re¬ 


sampling techniques such as jackknife and bootstrap may be also used. 


3.2 Bias under incompatible models 

In this section, we examine the hrst-order properties of our estimator in the presence of 
incompatible models. For clarity of exposition, in this section we consider the case where 
0 C , but analogous arguments can easily extended to the general case. To represent in¬ 
compatibility, we assume heterogeneous parameters for the first m* sub-models. Particularly, 
let gjiyj) = fiyfOj), 9j £ 0, (1 < j < m), where 9j, follows the drift model 9j = 9s = 9^ + 6, 
if j < m*, and 9j = 9o, if m* < j < m. In addition, we assume that the Fisher information 
Hji9) = Eg [d^ log/,(X; 9)/d9% j = 1,... , m, are bounded away from zero and inhnity. A 
first-order Taylor expansion of Unj and £nj in ([8]) about 9j under suitable regularity conditions 
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gives 


0 = ^exp{Q!i(%)4j(%)}Mnj(%) ^ ^exp{Q!i£j(%)} O^-Oo + Oq- 9j Hj(6 


i=i 


i=i 


Re-arranging the above expression leads to the following approximation for the bias 


% — 6*0 


m*6 exp{alii{9s)}Hi{9s) 


m* exp{alii{9s)}Hi{95) + {m - m*) exp{alini{9o)}Hi{9o) l + C{9o,6)' 


where 


C(« 0 . s) = - wm > - (^ - l) exp{a;(«,(e„) - 0(ft))}. 

m Hi[9s) C 2 Vm* / 

Therefore, an approximate npper bonnd to the bias, — 6*o|, is 

Max-Bias(0^|5) =- 

1 + ^ 

C2 

which is regarded as the worst-case bias nnder incompatible models. Clearly, when = 0 
(eqnivalently, a\ = 0), the worst-case bias grows linearly in 5. When > 0, Max-Bias(6*^|(5) 
is bonnded and the estimator 9^ achieves bias control. Particnlarly, if 5 = 0 and all the 
models are compatible, then Max-Bias(6*^|5) =0. If 5 is large, since the denominator in flT^ 
dominates the nnmerator, the maximal bias decreases qnickly to 0. 

A second-order Taylor expansion of Unj and Inj in dH]) abont 9j (not shown here) can 
be nsed to derive an npper bonnd for the mean sqnared error. Analogonsly to flT^ . when 
^ = 0 (eqnivalently, = 0), the worst-case mean sqnared error grows qnadratically in <5. 
When ^ > 0, the maximal mean sqnared error is bonnded, meaning that the estimator 9^ 
achieves both bias and variance control. This theoretical nnderstanding is conhrmed by the 
nnmerical simulations in Section 4. 

As an illustration. Figure [1] shows the maximal bias for the multivariate normal model 


|5| 


m ^ 
m* 


exp 




(13) 
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X ~ Nm{0,1) with 9j = 9o + 6 where h = 0 if j = 1,..., m*, and 9j = 9o, if m* < j < m. 
Clearly, the classic estimator with equal weights = 0) is very risky for this model, since the 
maximal bias can be potentially very large. This undesirable behavior can be easily avoided 
by setting .^ > 0. Thus, if the degree of incompatibility is strong ( |5| —)■ cxd), the worst-case 
bias approaches zero. For intermediate cases where |(5| < cxo the bias remains bounded and 
can be controlled by tuning 
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Figure 1: Worst-case bias for the multivariate normal location model X ~ Nio{9,I) with 
9j = 9o + 5 where 5 = 0, if 1 < j < m*, and 9j = 9o, if m* < j < 10 . Left: the 
curves correspond to different values of the constant a* described in Sections 13.11 and 13.21 
= 0,1,2,3,4, and m* = 1). Right; the curves correspond to increasing number of 
incompatible models, m*, ranging from 0 (horizontal solid line) to 4 = !)• 
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4 Examples 


4.1 Example 1: Estimation of correlation 

Suppose the random vector (Xi,X 2 , X 3 , X 4 , follows a multivariate normal distribution 
with zero mean vector, unit variances and covariances Cov{Xi, Xk) = if 2 < fc < 5, 

for some £ > 1, and Cov{Xj, Xk) = po otherwise. If we model X as a multivariate normal 
with zero mean vector and all correlations equal, then the model is clearly misspecified and 
the maximum likelihood estimator is not consistent for po- 

When constructing a composite likelihood function we only need pair-wise lower-dimensional 
likelihoods, since the marginal univariate sub-likelihoods do not contain information on po- 
Therefore the correlation estimator pg is obtained as described in Section [2] by maximizing 
the pairwise likelihood 

Lip\w) = '^Wjdnjkip) = |-|log(l -p 2 ) - ^ 

where SSjj = SSjk = J2i=i ^k'^ ■ Note that flTT|) refers to combining bi¬ 

variate normal models with zero mean and covariance given by 2 x 2 matrices with diagonal 
elements equal to 1 and off-diagonal elements equal to p. Therefore pg will be consistent for 
Po only if Wnu = ■■■ Wnw = 0. 

In Table 14.11 we show the hnite-sample bias and variance of the D-McLE for different 
values of As a comparison, we report results for the MLE and the usual McLE with 
uniform weights corresponds to the column with .^ = 0. When all the sub-likelihoods are 
compatible {e = 1), not surprisingly the MLE has the best performance in terms of variance. 
For the D-McLE, however, both bias and variance do not increase much as long as ^ is not 
too far from 0. In the presence of incompatible sub-models (e = 3, 5), the bias for the MLE 
and D-McLE with uniform weights = 0) is very large compared to the D-McLE with 
.^ > 0. For example, when e = 3, the bias of the D-McLE is negligible when ^ = 0.2. In 
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addition to bias control of D-McLE, we note also that our procedure also achieves variance 
reduction when ^ > 0 and n is small. These results suggest that by setting ^ slightly above 
zero (e.g., 0.1, 0.2, or 0.3) already gives substantial stability and reduce the mean squared 
error of the corresponding estimator, 9^. 


MLE D-McLE(0 


£ 


= 0.0 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

0.9 

1 






Bias^ X 100 






1 

0.00 

0.00 

0.06 

0.14 

0.21 

0.27 

0.37 

0.43 

0.52 

0.62 

0.68 

0.78 

3 

2.43 

1.75 

0.12 

0.00 

0.05 

0.12 

0.20 

0.26 

0.36 

0.43 

0.50 

0.58 

5 

6.32 

4.53 

0.46 

0.00 

0.05 

0.11 

0.19 

0.27 

0.34 

0.42 

0.51 

0.57 






VarxlOO 






1 

0.10 

0.12 

0.12 

0.12 

0.13 

0.14 

0.13 

0.13 

0.14 

0.14 

0.14 

0.15 

3 

0.18 

0.20 

0.14 

0.13 

0.13 

0.14 

0.13 

0.14 

0.14 

0.15 

0.16 

0.16 

5 

0.18 

0.22 

0.15 

0.13 

0.13 

0.14 

0.14 

0.14 

0.15 

0.15 

0.16 

0.17 


Table 1: Bias and variance for pairwise likelihood estimation of the correlation model 
A^5(0, S) with unit variances and Cov{Xi, X^) = po/y/e if 2 < /c < 5, and Cov{Xj, X^) = po 
otherwise, with po = 1/2 and £ = 1,3,5(£ = 1 corresponds to the correctly specihed model). 
The columns refer to maximum likelihood estimator (MLE) and the discriminative composite 
likelihood estimator (D-McLE) with ^ ranging from 0 to 1 (^ = 0 implies uniform weights). 
Results are based on 10^ Monte Carlo samples of size n = 50. 


Figure [2] illustrates the prohle plot (left) and parameter estimates (right) for a sample 
of n = 50 observations. When = 0, the estimator is unreliable with estimates between 
Po/v^ = 0.5/\/5 0.22 and po = 0.5. When ^ moves away from zero, the importance 

prohle shows two distinct groups of sub-likelihoods, with the four overlapping paths at the 
bottom corresponding to misspecihed sub-likelihoods. When ^ = 0.51, the estimator pg 
exploits correctly the information from the compatible sub-likelihoods and gives estimates 
close to the true value po = 1/2. Finally, as log(lO), a single partial likelihoods tends to 
dominate the others, but much of the information from the other useful data pairs is ignored. 
Therefore the composite estimate at .^ = log(lO) is inferior to that at .^ = 0.51, in terms of 
accuracy. 
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Figure 2: Estimation of the correlation model iV5(0,S) with unit variances and 
Cov{Xi, Xk) = po/\/5 {2 < k < 5), and Cov{Xj, Xk) = Po {j ^ k ^ 1), with true pa¬ 
rameter pq = 1/2. Left; Importance prohle paths for the partial likelihood components 
based on the estimated weights, Wn^- Right; estimated correlation coefficient (horizontal is 
the true value po = 0.5). Illustration based on 50 observations. 


4.2 Example 2: Location of heterogeneous normal variates 

Let (Xi,...,Xm) be independent normal variables with common mean E{Xj) = po {1 < 
j < m) and heterogeneous variances Var{Xj) = j (1 < j < m). This is the basic 
meta-analysis model where a weighted average of a series of study estimates, say {Xj}, is 
combined to obtain a more precise estimate for po- The inverse of the estimates’ variance, 
is the optimal study weight ensuring minimum variance of the combined estimate. All 
the parameter information is contained in the marginal models, so the following negative 
one-wise composite likelihood function is minimized; 

m r ^ n _ \2 1 

-24(p,o-i,.. .,am\w) = logo-| -h -> . (15) 

j=l L i=l t ) 
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and the profiled composite likelihood estimators are 



m m n ^ 


j=l j=l i=l 1=1 


Replacing fi = fiiw) and aj = d'j{w) in (lT5ll gives Xljli '^j which is then minimized 

snbject to the constraints Dkl{w) = ^ and 

The resnlting location estimator, say solves the hxed-point eqnation 



( 16 ) 


where di > 0 is compnted as in (|6]) for a given .^ > 0, and the variance estimators are 



The degree of incompatibility of models is very strong, then the estimator is nearly 
as good as the estimator obtained by ignoring the corresponding data snb-sets. If all the 
models are compatible, still performs well in terms of accnracy. Particnlarly, if all the 
partial likelihoods are correctly specihed, then = /xq- If Xj {1 < j < m*) are far 

away from po, then hnding is approximately eqnivalent to solving ffTbjl with Wnjifi) = 0 , 
if j < m*. 

Table 14.21 shows bias and variance for nnder correctly specihed and misspecihed snb- 
likelihoods. The nsnal McLE with nniform weights corresponds to the colnmn with ^ = 0. 
For comparison pnrposes, we also show the maximnm likelihood estimator with weights 
Wmiej cc ^/Sj, where Sj is the sample standard deviation for the jth variable. The re- 
snlts correspond to the location model with aoj = 1 /j (j = 1 ,..., 10 ) and misspecihcation 
introdnced by the location shift fij = fio + 1, j = 1,2. When all the snb-likelihoods are 
compatible (m* = 0), the MLE has the best performance, bnt the D-McLE with ^ = 0.1 
doing comparably well. In the presence of two incompatible snb-models (m* = 2), the bias 
for the MLE and D-McLE with nniform weights (.^ = 0) is large compared to the D-McLE 
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with ^ > 0. The bias is quite small when ^ = 0.3. The variance of D-McLE for 0 < e < 0.3 
is also quite small compared to the McLE with uniform weights; interestingly in a few cases 
the variance is smaller than that of the MLE. This confirms the behavior observed in other 
numerical examples and in the derivations given in Section 13.21 Across a number of other 
simulation settings, we found that ^ slightly larger than zero gives estimators with negligible 
bias and relatively small mean squared errors. 


n 

m* 

MLE 

e = 

0.0 

0.1 

0.2 

D-McLE(0 
0.3 0.4 

0.5 

0.6 

0.7 

10 

0 

0.00 

0.01 

0.00 

Bias^ 

0.01 

X 1000 

0.07 

0.14 

0.19 

0.21 

0.22 


2 

1.35 

36.32 

1.27 

0.16 

0.09 

0.22 

0.26 

0.59 

1.79 

100 

0 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 


2 

2.53 

39.47 

1.14 

0.21 

0.04 

0.00 

0.00 

0.00 

0.01 

10 

0 

3.51 

5.03 

4.08 

Varx 1000 
6.69 9.55 

11.01 

11.89 

12.49 

12.90 


2 

9.50 

6.66 

5.06 

6.93 

10.23 

15.14 

17.58 

18.70 

21.67 

100 

0 

0.41 

0.65 

0.43 

0.47 

0.53 

0.59 

0.65 

0.71 

0.76 


2 

0.53 

0.73 

0.48 

0.51 

0.53 

0.55 

0.57 

0.59 

0.61 

Table 2: 

Bias 

and variance 

for 

ocation 

estimates of X ~ 

0 

0 

^0), 

where E 


diag(l, 1/2,..., 1/10), with and without incompatible models (m* = 0, 2, respectively). The 
columns correspond to the maximum likelihood estimator (MLE) with weights proportional 
to {1/51}, where 5| is the sample standard deviation for the jth variable, and the composite 
likelihood estimator with ^ between 0 and 0.7 (D-McLE). For m* = 2, misspecification is 
introduced as pj = fio + 1, j = 1,2. Results based on 10"^ Monte Carlo samples of sizes 
n = 10,100. 


5 Multivariate models for spatial extremes: applica¬ 
tion to the Tasmanian rainfall data 


Max-stable processes have emerged as a useful re presentation of extrem e environmental 


occurrences such as hurricanes, floods and storms flDavison et ah 


2 OI 2 I) . However, their 


estimation poses significant challenges, since they lack of a general multivariate density 
expression. A well studied case is the Gaussian max-stable process dehned as Z{s) = 
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ma.Xi>i{Vif{Ui — s)}, where {Vi,Ui} is a Poisson process on (0, oo] x with intensity 
measure uj ds) x u~‘^du, and / is the bivariate normal distribution with zero mean and co- 


variance S (jSmithl . 


19901). T 


re process Z has unit Frechet margins with distribution function 


F{z) = exp(-l/; 2 ), z > 0. 


Smith! fjl990[) interprets Z as extreme environmental episodes, 


such as storms, where V, U, and / are the storm magnitude, center, and shape, respectively. 

Next, we apply the D-McLE to estimate the extreme covariance parameter S in the 
context of the Tasmania rainfall data described below. For a hnite set of spatially-referenced 


indexes, si,..., G M^, the joint c 
analytical representation for d > 2. 


istribution of t 


Padoan et ah 


le ra ndom vector Z(si ),..., Z{sd) has no 


fl2010l) give a closed-form expression for the 


bivariate density and propose estimation based on the pairwise likelihood function. Given 
n observations on d locations, z'f '^..., z^^\ {i = 1,..., n), the weighted pairwise likelihood 
function obtained by considering all m{m — l)/2 location pairs is 


m—1 m 




j=l k=j+l i=l 


where fz z^ is the bivariate density 


fziZ^{zj,Zk\'Z) = exp 




Zk 


X 


92{h)^{gi{h)} gi{h)^{g2{h)} 


a{hyx]zk 


aihfzjxl 


^{9i{h)} ^{9i{h)} ^{92{li)} 

O “T 


Xz 


a{hyx'j a{hyzjZk 


H92ih)} ^p{92ih)} (p{9iih)} 

o “T 


xt 


a{hyxl a{hyzjZk 


(17) 


In the above expression, <F and (p are the standard normal probability and density functions, 
respectively; h = [sj — Sk), a{h) = gi{h) = a{h)/2 + \og{xj/Zk)/a{h); and 

92 {h) = a{h) — gi{h). For hxed h, the extremal dependence behaviour is determined by S, 
which is therefore the main interest for inference. Since the above model requires unit Frechet 
margins, the observed margins, |/j, are transformed in unit Frechet by the transformation 
Vj ~ 9jiyj) = [i- + CjiVj ~ hil/bil + i where = max(0,M) and /ij, 'yj and Q are location, 
scale and shape parameters obtained from the empirical distribution. 
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We consider a data set of 20 yearly rainfall maxima recorded at 10 gauging stations from 
1995 to 2014 in the Australian state of Tasmania corresponding to the following locations, 
also shown in Figure E) Bushy Park, Ross, King Island, Eddistone Point, Geeveston, Strahan, 
Flinders Island, Marrawah, Rocky Point, Orford (source: http://wwwc.bom.gov.au/tas/). 
The max-stable Gaussian model is then htted using a pair-wise likelihood function including 
all m = ( 2 °) = 45 pairs of locations. We compute estimates for different choices of ^ 
ranging from 0 to log(45). Figure [3] (left) shows the a map of Tasmania with the 10 stations 
locations, and the edges denote fitted weights, Wnj corresponding to ^ = 0.3 (dashed lines 
represent weights smaller than the hrst quartile of htted weights). Figure [3] (right) shows 
GPP plots for the weights. We note that pairwise likelihoods involving the King Island 
station (located at coordinates -39.88 , 143.88 on the map) exhibit a very weak degree of 
compatibility compared to locations in the southern and eastern part of the island. This 
suggest a different pattern for the precipitations for King Island in relation to the rest of the 
stations; thus pair-wise sub-models involving such a station should be further inspected and 
possibly revised. 

Figure m shows estimated parameters dn, and (T 22 for values of ^ ranging from 0 to 
1; the vertical bands represent 95% conhdence intervals. For values of ^ larger than 0.3, the 
interval estimates appear quite stable. This can be seen by looking at the relative change in 
parameter estimate and also width of the confidence intervals. We can see that the estimated 
extremal correlation, p, is notably affected by the measurements in a single station (King 
Island). As the sub-likelihoods involving that particular station receive increasingly low 
weights, the estimates change substantially. This behavior is consistent with that observed 
in our simulated data. To compare fitted models we als o considered t h e com posite likelihood 


information criterion for model selection discussed in 

r-l 


Padoanetah 


(20101 ) and defined by 


CLIC{^) = —2in{0^) + where and are estimates of the matrices 

and defined in Section IRTI We found that the CLIC{^) decreases monotonically for 
^ in [0,1] - particularly we have we have CLIC{0) = 156.6 and CLIC{0.3) = 155.9. This 
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Figure 3: Left: Tasmania elevation map with location of the weather gauging stations. The 
dashed edges denote fitted weights Wnj (computed as described in Section I2.3p smaller than 
the hrst quartile for the weights (~ 0.0065) when ^ = 0.3. Right: compatibility profile plots 
for ^ between 0 and 1. 

suggests that ^ > 0 should be preferred to the usual composite likelihood estimator with 
uniform weights with = 0. 

6 Conclusion and final remarks 

This work introduces the D-McLE, a new estimator obtained by maximizing the weighted 
composite likelihood function subject to a discrimination constraint, which entails moving 
away by a distance ^ from uniform weights. The D-McLE has appealing features from 
both theoretical and practical viewpoints. First, we found that the data-adaptive weights 
render the parameter estimates more stable in the presence of incompatible models compared 
to classic composite likelihood approaches with fixed weights. This is clearly seen from our 
asymptotic derivations and our numerical simulations confirm this behavior in finite samples. 
Second, the estimated weights, which are a by-product of our procedure, can be used to rank 
the compatibility of lower-dimensional likelihoods and are a useful diagnostic tool for model 
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<D 



<0 



^ ^ ^ 

Figure 4: Estimation of the Gaussian max-stable model for the Tasmania rainfall data. 
Estimates of ctu, cru, (^22 for ^ ranging from 0 to 1. Vertical bars represent 95% conhdence 
intervals based on standard errors from the asymptotic distribution of the D-McLE. 


selection. For example, if the jth data sub-set receives an unusually small weight, it is likely 
that the corresponding model, is incompatible. Targeted analyses on the anomalous 

data sub-sets can lead to improved model assumptions. Third, our approach leads naturally 
to the algorithm in Section [231 which we found to be quite fast and and easy to implement. 

In recent years, high-dimensional estimation has become a core area of multivariate anal¬ 
ysis. We believe that the D-McLE will be valuable as a remedy to common shortcomings of 
the classic McLE with hxed weights and MLE when the sample size, n, is relatively small 
compared to the complexity of the full model. Specihcally, the constrained optimization 
problem ([5]) is a type of regularization approach where XiDKiiw^Wunif) can be regarded as 
complexity penalty which promotes sparsity and produces vectors w with many elements 
close to zero. Regularizatio n approaches have proved useful for high-dimensional model se¬ 


lection flFan and Lv . 


20101). Similarly, in this context, we believe that the design of new 


sparsity-inducing penalty schemes for likelihood selection would be an interesting direction 
for further exploration and is high priority in our research agenda. Findings would be partic¬ 
ularly valuable to spatial statistics and statistical genetics, where often the large number of 
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sub-likelihood components poses serious challenges to applicability of composite likelihood 
methods. 


Up to date, not many papers have expl ored the large-m beh avior of composite likelihood 


estimators from a theoretical perspective. 


Cox and ReidI (120041) provide useful explanations 


on the asymptotic behavior of the pairwise composite likelihood estimator as m —)■ cxo and 
n is hxed; particularly, they discuss how the presence of strongly correlated partial scores 
affects the usual convergence rate of McLE. Additional Monte Carlo experiments for the 
normal location model dehned in Section 14.21 (not reported here) show that in hnite sam¬ 
ples the D-McLE can reduce considerably the mean squared error of the uniformly weighted 
McLE - even under fully compatible models. Such accuracy gains are relatively large when 
m increases. Developing theoretical insight on this phenomenon - and particularly on the 
interplay between the type of regularization constraint and the mean squared error of the 
resulting estimator as m increases - would represent another exciting future research direc¬ 
tion. 


Appendix 

Lemma 1. If DKL{wn{0),Wunif) = f>0, then Veln{0) = YljLiWnj{0)unj{d) ■ Therefore, 

^eIn{G) = 0 implies Veai{9) = 0 with probability going to 1. 


Proof of Lemma 1. Let Q!((6') = Veo:i{9). Differentiating both sides of DxL(wn(6'), w«m/) = f 
gives 




ET=i + a^{9)u,{9)} 


where in{9) = YlT=i''^rij{9)inj{9)■ This implies V0in{9) = ^fLi'^nj(9)unj(9). A calculation 
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also shows 


'^e^nid) = a[{9) '^WnjiO){injiO) - in{d)y + Wnj{0)Unj{0), 


(18) 


i=i 


i=i 


Since the hrst sum in flTS]) is strictly positive with probability one as n —)■ cxo and the second 
sum equals zero by the Kullback-Leibler divergence constraint, we have that Vgai{9) = 0 
with probability one as n —)■ oo. 

Proof of Proposition 1 

The main goal is to show uniform convergence for the composite likelihood function £n{9). 
In particular, 


eee 


sup \£n{9) - e{9)\ < s\ip'^\wnji9)inj{9) - Wj{9)ij{9)\ (19) 

m m 

< ^sup \inj{9) - ij{9)\ + ^sup \ij{9)\ \Wnj{9) - Wj{9)\ (20) 


i=i 


eee 


i=i 


eee 


The hrst term in fl20|) converges to zero in probability by Condition C2. By the continuous 
mapping theorem, also the second term converges to zero. Next, note that in{9^) > in{9^) = 
£(0|) — Op(l), where the last equality follows from the weak law of large numbers, since the 
latter implies £nj{9’^) ^j(^|) (1 < J < "i), and the continuous mapping theorem. Hence 


— £{9^) < £n(9^) — ^(%) + Op(l) < sup \£n{9) — £{9) \ + Op(l) —)■ 0, (21) 


eee 


by Condition C2. Since the optimal parameter value is unique, (1^ implies 9^ A- 9t. 


Regularity conditions and proof of Proposition 2 

Let V denote the differential operator with respect to the parameter vector 0 G 0 C 
Un{9) = J2]Li'^nj{9)unj{9) dcuotes the weighted score p-vector, with partial scores Unj{9) = 
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n~^ Sr=i ^ log/j(|/j*^|6'), and are p x p matrices defined in the asymptotic variance 

expression (12). Assume (C1)-(C3) given in Proposition 1 and the additional regularity 
conditions: 

(C4) the sub-model fj{yj\0) is three times differentiable in 0, 1 < j < m; 

(C5) maxi<fc<m i?G|'WnA:(6')|^ is upper bounded by a constant; 

(C6) the smallest eigenvalue of is bounded away from zero; 

(C7) the elements of the matrix are upper bounded by a constant; 

(C8) the expectation of second-order partial derivatives of Unk{(^) with respect to G 
are upper bounded by a constant for all 0 in a neighborhood of 

By Taylor’s Theorem, there exists a random point 6 between 6'^ and 6^ such that 

0 = Ur,{e^) = Ur,{9l) + Vur,{9l){9^ - 91) + - 9l)^V‘^Ur,{9){9^ - 91). ( 22 ) 

For the first term Un{9^) = Wnj{9^)unj{9'^) in the above expansion, the central limit the¬ 

orem implies that ^/n Unj{9*^) converges weakly to a p-variate normal distribution with mean 
p* = EGUnj{9*^) and p x p covariance matrix V* = —Hj~^{9’^), for all j = 1,... ,m, where 
Hj{9) = EGVunj{9). Since £nj(9^) A ij{9'^) (j = 1,... ,m), the continuous mapping theo¬ 
rem implies that Wnj{9^) converges in probability to constants w* = Wj{9*^) (j = 1,... ,m). 
Therefore, by Slutsky’s theorem we have convergence in distribution of y/n Un{9^) to the 
normal mixture 

m 

V^un{9*^) AJ2y^,{9l)Np{p*,V*}. 
i=i 
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such that = 0- For VMn(6*|) in the second term of expansion ([22]), Lemma 1 

gives 

m 

Vur^iOD = Y^Wniioi) [Vu^jiei) + a,{ei)u^^{ei)unj{eif + {Va^{ei)}u^^{eifL^{ei)] 

m 

j=i 

where ai{9) is the solution of equation (6) and a\ = ai(6*|) denotes the solution of equation 
(6) with inj replaced by ij and 9 = 9'^. Convergence in probability follows from the contin¬ 
uous mapping theorem since inj{9'^) ^ Unj{9*^) A fi*, Vunj{9*^) A Hj{9^). Finally, 

for the third term of the expansion fl22|) by assumption, there is a neighborhood B of 
and a constant k for which each entry of the array EGV‘^Unk{9) < k for all 0 G -B and all 
/c = 1,... ,p. Therefore, ||V^-Unfc(6')|| is bounded in probability by the law of large numbers. 
By Proposition 1, A and the third term in the expansion fl2^ is of higher order than 
the second term, so the normality result follows by applying Slutsky’s Lemma. 
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